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ABSTRACT. In this overview article we present several methods for multiplying matrices and the implementa- 
tion of these methods in C. Also a little test program is given to compare their running time and the numerical 
stability. 

The methods are: naive method, naive method working on arrays, naive method with the KAHAN trick, 
three methods with loop unrolling, winograd method and the scaled variant, original STRASSEN method and 
the STRASSEN-WlNOGRAD variant. 



Please note, that this is the FIRST version. The algorithms are not well tested and the implemen- 
tation is not optimized. If you like to join the project, please contact me. 

The collection of algorithms and this document will be updated from time to time. 
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In the following text, A denotes a N x P matrix and B denotes a P x M matrix with entries of type 
double. The implementation is written in C based on the standard C99. The code and the current version 
of this document can be found here: 

http : / / www 2 . inf ormatik . uni-halle . de/ da/hedtke/ overview-momm/. 

Feel free to use the code for anything you want. DO NOT USE COMPILER OPTIMIZATION! 
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1. Auxiliary Routines 



1.1. Plus. According to the definition of the + operator for matrices 



we use a simple straightforward method to compute A + B. 

void Plus (A, B, Result, N, M) 
double** A; 
double** B; 
double** Result; 
int N; 
int M; 



for (int i = 0; i < N; i++) { 
for (int j = 0; j < M; j++) { 

Result[i][j] = A[i] [j] +B[i][j]; 

} 

} 



1 .2. Minus. Like above we use a straightforward method to compute 



void Minus (A, B, Result, N, M) 
double** A; 
double** B; 
double** Result; 
int N; 
int M; 



for (int i = 0; i < N; i++) { 
for (int j = 0; j < M; j++) { 

Result[i][j] = A[i] [j] -B[i][j]; 

} 

} 



1.3. max. As an often used auxiliary function we implement 



for integer and double input arguments. 

int max ( x, y ) 
int x; 
int y; 

{ 

if (x < y) { 

return y; 
} else { 

return x; 

} 

} 



(A + B)ij = Aij + Bij 



(A B)ij — A^ B^. 
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double Max ( x, y ) 
double x; 
double y; 

{ 

if (x < y) { 

return y; 
} else { 

return x; 

} 

} 



1.4. Normlnf. In the scaled variant of WlNOGRADs algorithm we need the oo-norm 



of a matrix A. We also use it to compare the numerical stability of the several methods in the tests. 

double Abs (double a) { 

if ( a < ) { 
return -a; 

} 

return a; 



double Norm = 0.0; 
double aux; 
int i, j; 

for (i = 0; i < N; i++) { 
aux = 0.0; 

for (j =0; j < P; j++) { 
aux += Abs (A[i] [ j] ) ; 

} 

Norm = Max (Norm, aux) ; 



1.5. MultiplyWithScalar. Also in the scaled variant of WlNOGRADs algorithm we need the product of a 
matrix A and a scalar a 



void MultiplyWithScalar (A, Result, N, M, alpha) 
double** A; 
double** Result; 
int N; 
int M; 

double alpha; 




double Normlnf (A, 
double** A; 
int N; 
int P; 



N, P) 



return Norm; 



(aA)ij = aAij. 
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{ 



for (int i = 0; i 
for (int j = 0; 

Result [i] [j] 



< N; i++) { 
j < M; j++) { 
= alpha*A [ i ] [ j ] ; 
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2. Methods of Matrix Multiplication 
2. 1 . The naive method. The naive method of matrix multiplication is given by the definiton itself: 

p 

{AB)ij = AjkBkj. 

k=l 

2.1.1. NaivStandard. First we implement this straightforward but we work with an auxiliary variable aux 
for the sum. 



void NaivStandard (A, B, Result, N, P, M) 
double** A; 
double** re- 
double** Result; 
int N; 
int P; 
int M; 

{ 

double aux; 

for (int i = 0; i < N; i++) { 
for (int j = 0; j < M; j++) { 
aux = 0.0; 

for (int k = 0; k < P; k++) { 
aux += A[i] [k] *B[k] [j] ; 

} 

Result [i] [j] = aux; 

} 

} 

} 



2.1.2. NaivOnArray. A common error is to work all the time on the arrays itself instead of using an auxil- 
iary variable. We only present this method to include it in the tests. 

void NaivOnArray (A, B, Result, N, P, M) 
double** A; 
double** B; 
double** Result; 
int N; 
int P; 
int M; 

{ 

for (int i = 0; i < N; i++) { 

for (int j = 0; j < M; j++) { 
Result [i] [ j ] = 0.0; 
for (int k = 0; k < P; k++) { 

Result [i] [j] += A[i] [k]*B[k] [j]; 

} 

} 

} 



} 
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2.1.3. NaivKahan. To improve the numerical stability of the process, we use the Kahan trick (see J3]) 
for the summation. Instead of computing a sum like 



with 

double sum = 0.0; 
for( int i = 1; i <= n; i++) { 
sum += x [ i ] ; 

} 

we use 

double t; 

double sum = 0.0; 

double err = 0.0; 

for( int i = 1; i <= n; i++) { 

err = err + x [ i ] ; 

t = sum + err; 

err = (sum - t) + err; 

sum = t; 



void NaivKahan (A, B, Result, N, P, M) 
double** A; 
double** Re- 
double** Result; 
int N; 
int P; 
int M; 



double t, sum, err; 

for (int i = 0; i < N; i++) { 
for (int j = 0; j < M; j++) { 
sum = 0.0; 
err = 0.0; 

for (int k = 0; k < P; k++) { 
err = err + A [ i ] [ k] *B [ k] [ j ] ; 
t = sum + err; 
err = (sum - t) + err; 
sum = t; 

} 

Result [i] [j] = sum; 



2.2. Loop unrolling. The method of loop unrolling doesn't reduce the abstract complexity of the matrix 
mutliplication problem. It tries to reduce the overhead from the for loops and speed up the real computa- 
tional work. 

2.2.1. NaivLoopUnrollingTwo. Instead of computing 



n 




i=l 



P 



we use 
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P/2 

(AB)ij = ^2 M,ik-\Bik-\,i + Ai,2kB 2 k,j, or 
k=i 

I L-P/2J \ 

(AB)ij = [ ^2 Ai,ik-\Bik-\,3 + A it 2kB2k,j +A l . P Bpj, 

k=l J 



if P is even or odd, resp. 



void NaivLoopUnrollingTwo (A, B, Result, N, P, M) 
double** A; 
double** B; 
double** Result; 
int N; 
int P; 
int M; 

{ 

int i, j, k; 
double aux; 

if (P % 2 == 0) { 

for (i = 0; i < N; i++) { 
for (j =0; j < M; j++) { 
aux = 0.0; 

for (k = 0; k < P; k += 2 ) { 

aux += A[i] [k]*B[k] [j] + A[i] [k+l]*B[k + l] [j]; 

} 

Result [i] [j] = aux; 

} 

} 

} else { 

int PP = P - 1; 
for (i = 0; i < N; i++) { 
for (j = 0; j < M; j++) { 
aux = 0.0; 

for (k = 0; k < PP; k += 2 ) { 

aux += A[i] [k]*B[k] [j] + A[i] [k+1] *B [k+1] [j]; 

} 

Result [i] [j] = aux + A[i] [PP]*B[PP] [j]; 

} 

} 



} 

} 



2.2.2. NaivLoopUnrollingThree. Like above we compute 



P/3 

(AB)ij = ^2 Ai,3k-2B 3 k-2,j + Ai^k-iB^k-i.j + A l ,2kB 2 kj 
k=i 
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instead of 

p 

(AB)ij = ^ AjkBkj, 

k=l 

if P is divisible by 3 (otherwise we use correction terms like above). 

void NaivLoopUnrollingThree (A, B, Result, N, P, M) 
double** A; 
double** re- 
double** Result; 
int N; 
int P; 
int M; 



int i, j, k; 
double aux; 

if (P % 3 == 0) { 

for (i = 0; i < N; i++) { 
for (j =0; j < M; j++) { 
aux = 0.0; 

for (k = 0; k < P; k += 3) { 

aux += A[i] [k]*B[k] [j] + A[i] [k+1] *B [k+1] [j] + A[i] [k+2]*B[k+2] [j]; 

} 

Result [i] [j] = aux; 

} 

} 

} else if (P % 3 == 1) { 

int PP = P - 1; 
for (i = 0; i < N; i++) { 
for (j = 0; j < M; j++) { 
aux = 0.0; 

for (k = 0; k < PP; k += 3) { 

aux += A[i] [k]*B[k] [j] + A[i] [k+l]*B[k+l] [j] + A[i] [k+2]*B[k+2] [j]; 

} 

Result [i] [j] = aux + A[i] [PP]*B[PP] [j]; 

} 

} 

} else { 

int PP = P - 2; 
int PPP = P - 1; 
for (i =0; i < N; i++) { 
for (j = 0; j < M; j++) { 
aux = 0.0; 

for (k = 0; k < PP; k += 3) { 

aux += A[i] [k]*B[k] [j] + A[i] [k+l]*B[k+l] [j] + A[i] [k+2]*B[k+2] [j]; 

} 

Result [i] [j] = aux + A[i] [PP]*B[PP] [j] + A[i] [PPP]*B[PPP] [j]; 

} 

} 



} 
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2.2.3. NaivLoopUnrollingFour. 



void NaivLoopUnrollingFour (A, B, Result, N, P, M) 
double** A; 
double** re- 
double** Result; 
int N; 
int P; 
int M; 

{ 

int i , j , k ; 
double aux; 

if (P % 4 == 0) { 

for (i = 0; i < N; i++) { 
for (j =0; j < M; j++) { 
aux = 0.0; 

for (k = 0; k < P; k += 4 ) { 

aux += A[i] [k]*B[k] [j] + A[i] [k+l]*B[k + l] [j] + A[i] [k+2]*B[k+2] [j] 
+ A[i] [k+3] *B [k+3] [ j] ; 

} 

Result [i] [j] = aux; 

} 

} 

} else if (P % 4 == 1) { 

int PP = P - 1; 
for (i = 0; i < N; i++) { 
for (j =0; j < M; j++) { 
aux = 0.0; 

for (k = 0; k < PP; k += 4 ) { 

aux += A[i] [k]*B[k] [j] + A[i] [k + l]*B[k + l] [j] + A[i] [k+2]*B[k+2] [j] 
+ A[i] [k + 3] *B[k+3] [ j] ; 

} 

Result [i] [j] = aux + A[i] [PP]*B[PP] [j]; 

} 

} 

} else if (P % 4 == 2) { 

int PP = P - 2; 
int PPP = P - 1; 
for (i =0; i < N; i++) { 
for (j = 0; j < M; j++) { 
aux = 0.0; 

for (k = 0; k < PP; k += 4 ) { 

aux += A[i] [k]*B[k] [j] + A[i] [k+l]*B[k + l] [j] + A[i] [k+2]*B[k+2] [j] 
+ A[i] [k+3] *B[k+3] [j] ; 

} 

Result [i] [j] = aux + A[i] [PP]*B[PP] [j] + A[i] [PPP]*B[PPP] [j]; 

} 

} 



} else { 
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int PP = P - 3; 
int PPP = P - 2; 
int PPPP = P - 1; 
for (i = 0; i < N; i++) { 
for (j =0; j < M; j++) { 
aux = 0.0; 

for (k = 0; k < PP; k += 4 ) { 

aux += A[i] [k]*B[k] [j] + A[i] [k+1] *B [k+1] [j] + A[i] [k+2]*B[k+2] [j] 
+ A[i] [k+3] *B[k+3] [ j] ; 

} 

Result [i] [j] = aux + A[i] [PP]*B[PP] [j] + A[i] [PPP]*B[PPP] [j] 
+ A[i] [PPPP] *B [PPPP] [j]; 

} 

} 



2.3. Winograd's algorithm. 

2.3.1. WinogradOriginal. From Shmuel WlNOGRAD we know a method that precomputes some values 
and reuse it in the computation of the entries of the result matrix. It is based on the fact that 

P/2 P/2 P/2 

{AB) ik = y^(A ij2 j-l + B 2 j,k){Ai,2j + B 2j-l,k) - ^2 A iM-l A i,2j - ^2 B 2j-l,k B 2j,k 
j=l 3 = l 3 = 1 

" v ' V v ' 

— ■Vi =--Zk 

if P is even. In the case that P is odd, we use \_P/2\ instead of P/2 and add the missing product Ai.pBs,k 
to each entry of the result matrix. 

In the implementation we use upsilon to indicate if P is even and gamma as [P/2\ . 

void WinogradOriginal (A, B, Result, N, P, M) 
double** A; 
double** B; 
double** Result; 
int N; 
int P; 
int M; 



int i, j, k; 
double aux; 

int upsilon = P % 2; 
int gamma = P - upsilon; 

double* y = malloc(M*sizeof (double) ) ; 
double* z = malloc(N*sizeof (double) ) ; 

for (i = 0; i < M; i++) { 
aux = 0.0; 

for (j = 0; j < gamma; j += 2 ) { 
aux += A[i] [ j] *A[i] [ j+1] ; 

} 

y[i] = aux; 

} 
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for (i = 0; i < N; i++) { 
aux = 0.0; 

for (j = 0; j < gamma; j += 2 ) { 
aux += B[ j] [i]*B[j+l] [i] ; 

} 

z [ i ] = aux; 



if (upsilon == 1) { 



* P is odd 

* The value A [ i ] [P ] *B [P ] [k] is missing in all auxiliary sums. 
*/ 

int PP = P-l; 
for (i = 0; i < M; i++) { 
for (k = 0; k < N; k++) { 
aux = 0.0; 

for (j = 0; j < gamma; j += 2) { 

aux += ( A[i][j] + B[j + l][k] )*( A[i] [j+1] + B[j][k] ); 

} 

Result[i] [k] = aux - y[i] - z [k] + A[i] [PP]*B[PP] [k] ; 

} 

} 



} else { 



/* 

* P is even 

* The result can be computed with the auxiliary sums. 
*/ 

for (i = 0; i < M; i++) { 
for (k = 0; k < N; k++) { 
aux = 0.0; 

for (j = 0; j < gamma; j += 2) { 

aux += ( A[i] [j] + B[j+l][k] )*( A[i][j+1] + B[j][k] ); 

} 

Result [i] [k] = aux - y[i] - z [k] ; 

} 

} 

} 



free(y) ; 
free ( z ) ; 



} 



2.3.2. WinogmdScciled. FromR. P. Brent we know (see [1]) that the error of WlNOGRAD's algorithm is 

waa ~ *+™-W iBii)'. 

where r is a parameter of the underlying number modell (defined by WILKINSON in J5]) and N is size of 
the matrices (in the case that N = P = M). Now suppose that ||A||/||B|| = k, then (\\A\\ + \\B\\) 2 = 
(fc + 2 + l/fc)||A||||B||,andso 

Aj2 j_ 1 9 AT _ Q 

\\E\\ < 2- T -— A (k + 2 + 
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According to Brent it is always possible to find an integer A such that 

1/2<^I<2 



And because of 



he gets the bound 



max fc + 2 + l/fc = 9/2 

i<fe<2 

\\E\\<2-t.^.(N 2 + 12N-8)-\\A\\-\\B\\ 

when multiplying the matrices 2 X A and 2~ X B (which doesn't change the result: (2 X A) ■ (2~ X B) 
2 X 2- X AB = AB). 

void WinogradScaled (A, B, Result, N, P, M) 
double** A; 
double** re- 
double** Result; 
int N; 
int P; 
int M; 



int i ; 



/* Create scaled copies of A and B */ 
double** CopyA = malloc(N*sizeof (double* )) ; 
for( i = 0; i < N; i++) { 

CopyA [ i ] =malloc(P*sizeof (double) ) ; 

} 

double** CopyB = malloc(P*sizeof (double* )) ; 
for( i = 0; i < P; i++) { 

CopyB [i] =malloc(M*sizeof (double) ) ; 

} 



/* Scaling factors */ 
double a = Normlnf (A, N, P) ; 
double b = Normlnf (B, P, M) ; 



double lambda = floor (0.5 + log (b/a) /log ( 4 ) ) ; 
/* Scaling */ 

MultiplyWithScalar (A, CopyA, N, P, pow(2, lambda)); 
MultiplyWithScalar (B, CopyB, P, M, pow(2, -lambda)); 

/* Using Winograd with scaled matrices */ 
WinogradOriginal (CopyA, CopyB, Result, N, P, M) ; 



2.4. Strassen's algorithm. The matrix multiplication algorithm from Volker Strassen is very fa- 
mous. We present two variants. 

2.4.1. StrassenNaiv. In his paper Gaussian Elimination is not Optimal (see [4]) STRASSEN developed a 
recursive algorithm a m ,k for matrix multiplication for square matrices of order m2 k , where k, m € N. 
Let a m .o be the naive algorithm for matrix multiplication. We assume that a m> k is known, then we define 
a m .fc+i as follows: 
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If A, B are matrices of order m2 k+1 to be multiplied, we write 



(1) 



A 



'An 


Au 


, B = 


B n 


B\2 


A 2 i 


A 2 2_ 


B21 


B22 



and C := AB 



Cu 
C21 



C12 
C22 



where Aij, Bij and C,j are matrices of order m2 k . With the following auxiliary matrices 



H 7 



= (A n + A 22 )(B 11 + B 22 ) 

= An(Bi2 — B22) 

= (^11+^12)^22 

= (A 12 -A 22 )(B 21 +B22) 



H 2 
H 6 



= (A 2 i + A 22 )B n 
= A22 (B21 — B\\) 
= {A21 -A n )(Bu 



B12) 



of order m2 k computed using a m .k for matrix multiplication and the usual algorithm for addition and 
subtraction of matrices we get 



C12 — H3 + B 5 

C22 = Hi + H3 — H2 + Hq . 



Cu = Hi + H4 — H 5 + Hi 
C21 = H 2 + H4 

as the entries of the result matrix C. 

To use the algorithm above we define X := max{./V, P, M} and set k := \}og 2 X\ — 4 and m := 
[X2~ k \ + 1. Now with Y := m2 k we define matrices A and B of size Y xY. The main idea is, to embed 
the given matrices into 7x7 matrices, by adding zero rows and columns. Now we use the STRASSEN 
algorithm to compute AB and extract the result AB from it, by removing the additional rows and colums. 

In the implementation we use ResultPart=C, Auxl=i?i, Auxl=Hj, AuxResult=AB, 
MaxSize=X and NewSize=F. 



void StrassenNaiv (A, B, Result, N, P, M) 
double** A; 
double** re- 
double** Result; 
int N; 
int P; 
int M; 



int MaxSize, k, m, NewSize, i, j; 



MaxSize = max(N,P); 
MaxSize = max (MaxSize, M) ; 



if ( MaxSize < 16) { 

MaxSize = 16; // otherwise it is not possible to compute k 

} 



k = floor (log (MaxSize) /log (2) ) - 4; 
m = floor (MaxSize * pow(2,-k)) + 1; 



NewSize = m * pow(2,k); 



// add zero rows and columns to use Strassens algorithm 



double** NewA = malloc(NewSize*sizeof (double* )) ; 
double** NewB = malloc(NewSize*sizeof (double* )) ; 
double** AuxResult = malloc(NewSize*sizeof (double* )) ; 
for( i = 0; i < NewSize; i++) { 

NewA[i] = malloc(NewSize*sizeof (double) ) ; 

NewB[i] = malloc(NewSize*sizeof (double) ) ; 

AuxResult [i] = malloc(NewSize*sizeof (double) ) ; 

} 
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for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 
NewA [ i ] [ j ] = 0.0; 
NewB [ i ] [ j ] = 0.0; 

} 

} 

for( i = 0; i < N; i++) { 
for ( j = 0; j < P; j++) { 
NewA[i] [ j] = A[i] [ j] ; 

} 

} 

for( i = 0; i < P; i++) { 
for ( j = 0; j < M; j++) { 
NewB[i] [ j] = B[i] [ j] ; 

} 

} 

StrassenNaivStep (NewA, NewB, AuxResult, NewSize, m) ; 

// extract the result 

for( i = 0; i < N; i++) { 
for ( j = 0; j < M; j++) { 

Result [i] [j] = AuxResult [i] [j] ; 

} 

} 



void StrassenNaivStep (A, B, Result, N, m) 
double** A; 
double** B; 
double** Result; 
int N; 

int m; //to reduce the recursion; idea from Strassen 

{ 



int i, j, NewSize; 

if ( (N % 2 == 0) && (N > m) ) { // recursive use of StrassenNaivStep 
NewSize = N / 2; 
// decompose A and B 

// create ResultPart, Auxl , . . . , Auxl and Helperl, Helper2 
double** All = malloc(NewSize*sizeof (double* ) ) 
= malloc(NewSize*sizeof (double* ) ) 
= malloc(NewSize*sizeof (double* ) ) 
= malloc(NewSize*sizeof (double* ) ) 
= malloc(NewSize*sizeof (double* ) ) 
= malloc(NewSize*sizeof (double* ) ) 
= malloc(NewSize*sizeof (double* ) ) 
= malloc(NewSize*sizeof (double* ) ) 

malloc(NewSize*sizeof (double* ) ) ; 
malloc(NewSize*sizeof (double* ) ) ; 



double** A12 
double** A21 
double** A22 
double** Bll 
double** B12 
double** B21 
double** B22 
double** ResultPart 11 
double** ResultPartl2 
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0; 



double** Aux2 
double** Aux3 
double** Aux4 
double** Aux5 
double** Aux6 
double** Aux7 
for( i = 

All [i] 

A12 [i] 

A21 [i] 

A22 [i] 

Bll [i] 

B12 [i] 

B21 [i] 



double** ResultPart21 = malloc(NewSize*sizeof (double* ) 
double** ResultPart22 = malloc(NewSize*sizeof (double* ) 
double** Helperl = malloc(NewSize*sizeof (double* )) ; 
double** Helper2 = malloc(NewSize*sizeof (double* )) ; 
double** Auxl = malloc(NewSize*sizeof (double* ) ) 
= malloc(NewSize*sizeof (double* ) ) 
= malloc(NewSize*sizeof (double* ) ) 
= malloc(NewSize*sizeof (double* ) ) 
= malloc(NewSize*sizeof (double* ) ) 
= malloc(NewSize*sizeof (double* ) ) 
= malloc(NewSize*sizeof (double* ) ) 
i < NewSize; i++) { 
malloc(NewSize*sizeof (double) ) 
malloc(NewSize*sizeof (double) ) 
malloc(NewSize*sizeof (double) ) 
malloc(NewSize*sizeof (double) ) 
malloc(NewSize*sizeof (double) ) 
malloc(NewSize*sizeof (double) ) 
malloc(NewSize*sizeof (double) ) 
B22[i] = malloc(NewSize*sizeof (double) ) 
ResultPartll [i] = malloc(NewSize*sizeof (double) ) 
ResultPart 12 [ i ] = malloc(NewSize*sizeof (double) ) 
ResultPart 2 1 [ i ] = malloc(NewSize*sizeof (double) ) 
ResultPart 22 [ i ] = malloc(NewSize*sizeof (double) 
Helperl [i] = malloc(NewSize*sizeof (double) ) ; 
Helper2[i] = malloc(NewSize*sizeof (double) ) ; 
Auxlfi] = malloc(NewSize*sizeof (double) ) 
Aux2[i] = malloc(NewSize*sizeof (double) ) 
Aux3[i] = malloc(NewSize*sizeof (double) ) 
Aux4[i] = malloc(NewSize*sizeof (double) ) 
Aux5[i] = malloc(NewSize*sizeof (double) ) 
Aux6[i] = malloc(NewSize*sizeof (double) ) 
Aux7[i] = malloc(NewSize*sizeof (double) ) 

} 



// fill new matrices 

for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 
All[i] [j] = A[i] [j]; 



} 



} 



for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 
A12[i] [j] = A[i] [NewSize + j]; 

} 

} 



for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 
A21[i] [j] = A[NewSize + i] [j] 

} 

} 



for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 

A22[i] [j] = A[NewSize + i] [NewSize + j]; 
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} 

} 

for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 
Bll[i] [j] = B[i] [j]; 

} 

} 

for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 
B12[i][j] = B[i] [NewSize + j]; 

} 

} 

for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 
B21 [i] [ j] = B [NewSize + i] [ j] ; 

} 

} 

for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 

B22 [i] [ j] = B[NewSize + i] [NewSize + j] ; 

} 

} 

// computing the seven aux . variables 

Plus (All, A22, Helperl, NewSize, NewSize); 
Plus(Bll, B22, Helper2, NewSize, NewSize); 
StrassenNaivStep (Helperl, Helper2, Auxl, NewSize, m) ; 

Plus(A21, A22, Helperl, NewSize, NewSize); 
StrassenNaivStep (Helperl, Bll, Aux2, NewSize, m) ; 

Minus (B12, B22, Helperl, NewSize, NewSize); 
StrassenNaivStep (All, Helperl, Aux3, NewSize, m) ; 

Minus (B21, Bll, Helperl, NewSize, NewSize); 
StrassenNaivStep (A22 , Helperl, Aux4, NewSize, m) ; 

Plus (All, A12, Helperl, NewSize, NewSize); 
StrassenNaivStep (Helperl, B22, Aux5, NewSize, m) ; 

Minus (A21, All, Helperl, NewSize, NewSize); 
Plus (Bll, B12, Helper2, NewSize, NewSize); 
StrassenNaivStep (Helperl, Helper2, Aux6, NewSize, m) ; 

Minus (A12, A22, Helperl, NewSize, NewSize); 
Plus(B21, B22, Helper2, NewSize, NewSize); 
StrassenNaivStep (Helperl, Helper2, Aux7, NewSize, m) ; 

// computing the four parts of the result 



Plus (Auxl, Aux4, ResultPartll, NewSize, NewSize); 

Minus (ResultPartll, Aux5, ResultPartll, NewSize, NewSize); 

Plus (ResultPartll, Aux7, ResultPartll, NewSize, NewSize); 
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Plus(Aux3, Aux5, ResultPartl2, NewSize, NewSize); 

Plus(Aux2, Aux4, ResultPart21, NewSize, NewSize); 

Plus(Auxl, Aux3, ResultPart22, NewSize, NewSize); 

Minus (ResultPart22, Aux2, ResultPart22, NewSize, NewSize); 

Plus (ResultPart22, Aux6, ResultPart22, NewSize, NewSize); 

// store results in the "result matrix" 

for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 

Result [i] [ j] = ResultPartll [i] [ j] ; 

} 

} 

for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 

Result [i] [NewSize + j] = ResultPart 12 [ i ][ j ] ; 

} 

} 

for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 

Result [NewSize + i][j] = ResultPart 2 1 [ i ][ j ] ; 

} 

} 

for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 

Result [NewSize + i] [NewSize + j] = ResultPart22 [ i ] [ j ] ; 

} 

} 

// free helper variables 

free (All) ; 
free(A12) ; 
free(A21) ; 
free(A22) ; 

free(Bll) ; 
free(B12) ; 
free(B21) ; 
free(B2 2) ; 

free(ResultPartll) ; 
free(ResultPartl2) ; 
free(ResultPart21) ; 
free(ResultPart22) ; 

free (Helper 1 ) ; 
f ree (Helper2 ) ; 

free (Auxl ) ; 
free(Aux2) ; 
free(Aux3) ; 
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free(Aux4) ; 
free(Aux5) ; 
free (Aux6) ; 
free (Aux7 ) ; 

} else { 

// use naiv algorithm 
NaivStandard(A, B, Result, N, N, N) ; 



2.4.2. StrassenWinograd. In the paper Efficient Procedures for using Matrix Algorithms (see [2|) PATRICK 
C. Fischer and Robert L. Probert discussed an idea of Shmuel Winograd which reduces the 
number of used additions to 15 (instead of 18 in StrassenNaiv). 

Like above, we define a recursive algorithm a for multiplication of square matrices of order m2 fe+1 . 
Let A and B be matrices of this size. We assume that a is already known for the order m2 k . We define 
C := AB and decompose A, B and C according to Equation ([TJ. We define 

A\ := An — A 2 i Bi := B 2 2 — -B12 



and 



With 



we finally get 



A 2 := A 22 - Ai B 2 := Bi + B x 



Hi = AiiBn H 5 = A\B\ 

H 2 = A12-B21 Hq — (A12 — A 2 )B 2 2 

H3 = A2B2 H7 = A22(B2\ — B2) 

Hi = (A 21 +A 22 )(B 12 -B 11 ). 

Hs ■= Hi + H 3 Hg := H s + H4 

C\\ = Hi + H 2 C12 — Hq + Hq 

C21 = H s + H 5 + H-j C22 = H g + H 5 . 



void StrassenWinograd (A, B, Result, N, P, M) 
double** A; 
double** B; 
double** Result; 
int N; 
int P; 
int M; 



int MaxSize, k, m, NewSize, i, j; 

MaxSize = max(N,P); 
MaxSize = max (MaxSize, M) ; 

if ( MaxSize < 16) { 

MaxSize = 16; // otherivise it is not possible to compute k 

} 

k = floor (log (MaxSize) /log (2) ) - 4; 
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m = floor (MaxSize * pow(2,-k)) + 1; 
NewSize = m * pow(2,k); 

// add zero rows and columns to use Strassens algorithm 

double** NewA = malloc(NewSize*sizeof (double* )) ; 
double** NewB = malloc(NewSize*sizeof (double* )) ; 
double** AuxResult = malloc(NewSize*sizeof (double* )) ; 
for( i = 0; i < NewSize; i++) { 

NewA[i] = malloc(NewSize*sizeof (double) ) ; 

NewB[i] = malloc(NewSize*sizeof (double) ) ; 

AuxResult [i] = malloc(NewSize*sizeof (double) ) ; 

} 

for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 
NewA [ i ] [ j ] = 0.0; 
NewB [ i ] [ j ] = 0.0; 

} 

} 

for( i = 0; i < N; i++) { 
for ( j = 0; j < P; j++) { 
NewA[i] [ j] = A[i] [ j] ; 

} 

} 

for( i = 0; i < P; i++) { 
for ( j = 0; j < M; j++) { 
NewB[i] [ j] = B[i] [ j] ; 

} 

} 

StrassenWinogradStep (NewA, NewB, AuxResult, NewSize, m) ; 

// extract the result 

for( i = 0; i < N; i++) { 
for ( j = 0; j < M; j++) { 

Result [i] [j] = AuxResult [i] [ j ] ; 

} 

} 



void StrassenWinogradStep (A, B, Result, N, m) 
double** A; 
double** B; 
double** Result; 
int N; 

int m; //to reduce the recursion; idea from Strassen 



int i, j, NewSize; 

if ( (N % 2 == 0) && (N > m) ) { // recursive use of StrassenNaivStep 
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NewSize = N / 2; 



// decompose A and B 

// create ResultPart, Auxl Auxl and Helperl, Helper2 
double** All = malloc(NewSize*sizeof (double* 
double** A12 = malloc(NewSize*sizeof (double* 
double** A21 = malloc(NewSize*sizeof (double* 
double** A22 = malloc(NewSize*sizeof (double* 
double** Bll = malloc(NewSize*sizeof (double* 
double** B12 = malloc(NewSize*sizeof (double* 
double** B21 = malloc(NewSize*sizeof (double* 
double** B22 = malloc(NewSize*sizeof (double* 
double** Al = malloc(NewSize*sizeof (double* ) 
double** A2 = malloc(NewSize*sizeof (double* ) 
double** Bl = malloc(NewSize*sizeof (double* ) 
double** B2 = malloc(NewSize*sizeof (double* ) 
double** ResultPartll = malloc(NewSize*sizeof (double* ) ) 
double** ResultPartl2 = malloc(NewSize*sizeof (double* ) ) 
double** ResultPart21 = malloc(NewSize*sizeof (double* ) ) 
double** ResultPart22 = malloc(NewSize*sizeof (double* ) ) 
double** Helperl = malloc(NewSize*sizeof (double* )) ; 
double** Helper2 = malloc(NewSize*sizeof (double* )) ; 
double** Auxl = malloc(NewSize*sizeof (double* ) ) 
malloc(NewSize*sizeof (double* ) ) 
malloc(NewSize*sizeof (double* ) ) 
malloc(NewSize*sizeof (double* ) ) 
malloc(NewSize*sizeof (double* ) ) 
malloc(NewSize*sizeof (double* ) ) 
malloc(NewSize*sizeof (double* ) ) 
malloc(NewSize*sizeof (double* ) ) 
malloc(NewSize*sizeof (double* ) ) 
for( i = 0; i < NewSize; i++) { 

All [i] = malloc(NewSize*sizeof (double) ) 
A12 [i] = malloc(NewSize*sizeof (double) ) 
A21 [i] = malloc(NewSize*sizeof (double) ) 
A22[i] = malloc(NewSize*sizeof (double) ) 
Bll [i] = malloc(NewSize*sizeof (double) ) 
B12 [i] = malloc(NewSize*sizeof (double) ) 
B21 [i] = malloc(NewSize*sizeof (double) ) 
B22[i] = malloc(NewSize*sizeof (double) ) 
Al[i] = malloc(NewSize*sizeof (double) ) ; 
A2[i] = malloc(NewSize*sizeof (double) ) ; 
Bl[i] = malloc(NewSize*sizeof (double) ) ; 
B2[i] = malloc(NewSize*sizeof (double) ) ; 
ResultPart 1 1 [ i ] = malloc(NewSize*sizeof (double) ) ; 
ResultPartl2 [i] = malloc(NewSize*sizeof (double) ) ; 
ResultPart21 [i] = malloc(NewSize*sizeof (double) ) ; 
ResultPart22 [i] = malloc(NewSize*sizeof (double) ) ; 
Helperl [i] = malloc(NewSize*sizeof (double) ) ; 
Helper2[i] = malloc(NewSize*sizeof (double) ) ; 
Auxlfi] = malloc(NewSize*sizeof (double) ) 
Aux2[i] = malloc(NewSize*sizeof (double) ) 
Aux3[i] = malloc(NewSize*sizeof (double) ) 
Aux4[i] = malloc(NewSize*sizeof (double) ) 
Aux5[i] = malloc(NewSize*sizeof (double) ) 
Aux6[i] = malloc(NewSize*sizeof (double) ) 
Aux7[i] = malloc(NewSize*sizeof (double) ) 
Aux8[i] = malloc(NewSize*sizeof (double) ) 



double** Aux2 

double** Aux3 

double** Aux4 

double** Aux5 

double** Aux6 

double** Aux7 

double** Aux8 

double** Aux9 
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Aux9[i] = malloc(NewSize*sizeof (double) ) ; 

} 

// fill new matrices 

for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 
All[i] [j] = A[i] [j]; 

} 

} 

for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 
A12[i] [ j] =A[i] [NewSize + j] ; 

} 

} 

for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 
A21 [i] [ j] = A [NewSize + i] [ j] ; 

} 

} 

for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 

A22[i][j] = A[NewSize + i] [NewSize + j]; 

} 

} 

for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 
Bll[i] [j] = B[i] [j]; 

} 

} 

for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 
B12[i][j] = B[i] [NewSize + j]; 

} 

} 

for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 
B21[i] [j] = B[NewSize + i] [j]; 

} 

} 

for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 

B22 [i] [ j] = B[NewSize + i] [NewSize + j] ; 

} 

} 

// computing the 4+9 aux. variables 



Minus (All, A21, Al, NewSize, NewSize); 
Minus (A22, Al, A2 , NewSize, NewSize); 
Minus (B22, B12, Bl, NewSize, NewSize); 
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Plus(Bl, Bll, B2, NewSize, NewSize); 

StrassenWinogradStep (All, Bll, Auxl, NewSize, m) ; 
StrassenWinogradStep (A12, B21, Aux2, NewSize, m) ; 
StrassenWinogradStep (A2, B2, Aux3, NewSize, m) ; 
Plus(A21, A22, Helperl, NewSize, NewSize); 
Minus (B12, Bll, Helper2, NewSize, NewSize); 
StrassenWinogradStep (Helperl, Helper2, Aux4, NewSize, m) ; 
StrassenWinogradStep (Al, Bl, Aux5, NewSize, m) ; 
Minus (A12, A2, Helperl, NewSize, NewSize); 
StrassenWinogradStep (Helperl, B22, Aux6, NewSize, m) ; 
Minus (B21, B2, Helperl, NewSize, NewSize); 
StrassenWinogradStep (A22, Helperl, Aux7, NewSize, m) ; 
Plus (Auxl, Aux3, Aux8, NewSize, NewSize); 
Plus(Aux8, Aux4, Aux9, NewSize, NewSize); 

// computing the four parts of the result 

Plus (Auxl, Aux2, ResultPartll, NewSize, NewSize); 

Plus(Aux9, Aux6, ResultPartl2, NewSize, NewSize); 

Plus(Aux8, Aux5, Helperl, NewSize, NewSize); 

Plus (Helperl, Aux7, ResultPart21, NewSize, NewSize); 

Plus(Aux9, Aux5, ResultPart22, NewSize, NewSize); 

// store results in the "result matrix" 

for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 

Result [i] [j] = ResultPartll [i] [j] ; 

} 

} 

for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 

Result [i] [NewSize + j] = ResultPart 12 [ i ] [ j ] ; 

} 

} 

for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 

Result [NewSize + i][j] = ResultPart21 [i] [ j ] ; 

} 

} 

for( i = 0; i < NewSize; i++) { 
for( j = 0; j < NewSize; j++) { 

Result [NewSize + i] [NewSize + j] = ResultPart22 [ i ] [ j ] ; 

} 

} 

// free helper variables 



free (All) 
free(A12) 
free(A21) 
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free(A22) ; 

free(Bll) ; 
free(B12) ; 
free(B21) ; 
free(B2 2) ; 

free (Al ) ; 
free (A2 ) ; 
free(Bl) ; 
free(B2) ; 

free(ResultPartll) ; 
free(ResultPartl2) ; 
free(ResultPart21) ; 
free(ResultPart22) ; 

free(Helperl) ; 
free(Helper2) ; 

free (Auxl ) ; 
free (Aux2 ) ; 
free(Aux3) ; 
free (Aux4 ) ; 
free(Aux5) ; 
free(Aux6) ; 
free (Aux7 ) ; 
free(Aux8) ; 
free(Aux9) ; 

} else { 

// use naiv algorithm 

NaivStandard (A, B, Result, N, N, N) ; 



} 
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3. Tests 



To test the algorithms we compute the product of a random square matrix A with its scaled variant 
B := 8A. The test routine is part of the package of C files. Is has the command line options 

• -0 matrix size, standard: 400 

• -R number of repeats, standard: 10 

+ 
I 
I 



C = A* (8A) 


TIME TEST FOR 
, where A is a 


METHODS 
(n x n) 


OF MATRIX MULTIPLICATION 
random matrix with n = 200 


method 




1 


time (sec) | Normlnf ( N-C ) 



N := NaivKahan 





146027 






NaivStandard 





083263 





0000000000 


NaivOnArray | 





121735 





0000000000 


NaivLoopUnrollingTwo 





072981 





0000000000 


NaivLoopUnrollingThree 





068356 





0000000000 


NaivLoopUnrollingFour 





067080 





0000000000 


StrassenNaiv 





077584 





0000000001 


St rassenWinograd 





080180 





0000000000 


WinogradOriginal 





073193 





0000000001 


WinogradScaled 





061350 





0000000001 



+- 



TIME TEST FOR METHODS OF MATRIX MULTIPLICATION 

random matrix with n = 



C = A*(8A), where A is a (n x n) 

+ 

I method 



800 



-+ 

I 
I 
I 

-+ 

I 

-+ 

I 
I 
I 
I 
I 
I 
I 
I 
I 
I 

-+ 



time (sec) 



Normlnf ( N-C ) 



N := NaivKahan 


11 


817480 | 






NaivStandard 


7 


370365 | 





0000000009 


NaivOnArray | 


10 


828352 | 





0000000009 


NaivLoopUnrollingTwo 


6 


914036 | 





0000000006 


NaivLoopUnrollingThree 


6 


605404 | 





0000000005 


NaivLoopUnrollingFour 


6 


480659 | 





0000000004 


StrassenNaiv 


3 


806864 | 





0000000022 


St rassenWinograd 


3 


978543 | 





0000000010 


WinogradOriginal 


6 


892913 | 





0000000036 


WinogradScaled 


5 


781587 | 





0000000022 
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